Goto

Collaborating Authors

 gradient estimate




398475c83b47075e8897a083e97eb9f0-Supplemental.pdf

Neural Information Processing Systems

We revisit first-order optimization under local information constraints such as local privacy, gradient quantization, and computational constraints limiting access to a few coordinates of the gradient. In this setting, the optimization algorithm is not allowed to directly access the complete output of the gradient oracle, but only gets limited information about it subject to the local information constraints. We study the role of adaptivity in processing the gradient output to obtain this limited information from it. We consider optimization for both convex and strongly convex functions and obtain tight or nearly tight lower bounds for the convergence rate, when adaptive gradient processing is allowed. Prior work was restricted to convex functions and allowed only nonadaptive processing of gradients. For both of these function classes and for the three information constraints mentioned above, our lower bound implies that adaptive processing of gradients cannot outperform nonadaptive processing in most regimes of interest. We complement these results by exhibiting a natural optimization problem under information constraints for which adaptive processing of gradient strictly outperforms nonadaptive processing.



Optimistic Bandit Convex Optimization

Neural Information Processing Systems

We introduce the general and powerful scheme of predicting information re-use in optimization algorithms. This allows us to devise a computationally efficient algorithm for bandit convex optimization with new state-of-the-art guarantees for both Lipschitz loss functions and loss functions with Lipschitz gradients.


Reducing Reparameterization Gradient Variance

Neural Information Processing Systems

Optimization with noisy gradients has become ubiquitous in statistics and machine learning. Reparameterization gradients, or gradient estimates computed via the ``reparameterization trick,'' represent a class of noisy gradients often used in Monte Carlo variational inference (MCVI). However, when these gradient estimators are too noisy, the optimization procedure can be slow or fail to converge. One way to reduce noise is to generate more samples for the gradient estimate, but this can be computationally expensive. Instead, we view the noisy gradient as a random variable, and form an inexpensive approximation of the generating procedure for the gradient sample. This approximation has high correlation with the noisy gradient by construction, making it a useful control variate for variance reduction. We demonstrate our approach on a non-conjugate hierarchical model and a Bayesian neural net where our method attained orders of magnitude (20-2{,}000$\times$) reduction in gradient variance resulting in faster and more stable optimization.